Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

112

Binary Neural Architecture Search

(a)

(c)

(b)

FIGURE 4.14

The loss landscape illustration of supernet. (a) The gradient of current weights with diﬀerent

α, (b) the vanilla α^t⁺¹with backpropagation, (c) ˜α^t⁺¹with the decoupled optimization.

a⁽^j⁾

i<j ^softmax(^α⁽^i,j⁾

)(w⁽^i,j⁾⊗a⁽ⁱ⁾), where w⁽^i,j⁾

[[wm]]

∈

RM×1, wm

∈

RCout×Cin×Km×Km denotes the weights of all candidate operations between the i-th and

j-th nodes and Km denotes the kernel size of the m-th operation. Speciﬁcally, for pooling

and identity operations, Km equals the downsample size and the size of the feature map,

wm equals 1/(Km × Km) and 1, respectively. For each intermediate node, its output a⁽^j⁾is

jointly determined by α⁽^i,j⁾

and w⁽^i,j⁾

, while a⁽ⁱ⁾is independent of both α⁽^i,j⁾

and w⁽^i,j⁾

. As

shown in Figs. 4.14 (a) and (b), with diﬀerent α, the gradient of the corresponding w can

be varied and sometimes diﬃcult to optimize, possibly trapped in local minima. However,

by decoupling α and w, the supernet can jump out of the local minima and be optimized

with better convergence.

Based on the deviation and analysis above, we propose our objective for optimizing the

neural architecture search process

arg min

α,w ^L⁽^w^{, α}^{) =}

LNAS + reg(w),

for Parent model

LDCP-NAS + reg(w),

for Child model

(4.42)

where α ∈RE×M, w ∈RM×1, and reg(·) denotes the regularization item. Following

[151, 265], the weights w and the architectural parameters α are optimized sequentially,

in which w and α are updated independently. However, optimizing w and α independently

is improper due to their coupling relationship. We consider the searching and training pro-

cess of diﬀerentiable Chile-Parent neural architecture search as a coupling optimization

problem and solve the problem using a new backtracking method. Details will be shown in

Section 4.4.6.

Decoupled Optimization for Child-Parent model From a new perspective, we recon-

sider the coupling relation between w and α. The derivative calculation process of w should

consider its coupling parameters α. Based on the chain rule [187] and its notation, we have

the following.

˜α^t⁺¹= α^t+ η1(−^∂^L⁽^α^t^,^w^t⁾

∂α^t

+ η2Tr[(^∂^L⁽^α^t^,^w^t⁾

∂w^t

)^T^∂^w^t

∂α^t^])

= α^t⁺¹+ η1η2Tr[(^∂^L⁽^α^t^,^w^t⁾

∂w^t

)^T^∂^w^t

∂α^t^]^,

(4.43)

where η1 represents the learning rate, η2 represents the backtracking coeﬃcient, and ˜α^t⁺¹

denotes the value after the backtracking of vanilla α^t⁺¹. In contrast, vanilla α^t⁺¹is calcu-

lated from the backpropagation rule and the corresponding optimizer in the neural network.